Skip to content

LLM benchmarking framework with SystemDS, vLLM & OpenAI backends - LDE Project#2431

Open
kubraaksux wants to merge 76 commits intoapache:mainfrom
kubraaksux:llm-benchmark
Open

LLM benchmarking framework with SystemDS, vLLM & OpenAI backends - LDE Project#2431
kubraaksux wants to merge 76 commits intoapache:mainfrom
kubraaksux:llm-benchmark

Conversation

@kubraaksux
Copy link

@kubraaksux kubraaksux commented Feb 16, 2026

Adds the llmPredict DML built-in and a benchmarking framework that evaluates it against OpenAI API and vLLM across 5 workloads. Developed as part of the LDE course. (Supersedes the closed #2430.)

What this PR adds

Java (llmPredict built-in):

  • LlmPredictCPInstruction.java — dedicated CP instruction class, extracted from ParameterizedBuiltinCPInstruction
  • Structured error handling: ConnectException, SocketTimeoutException, MalformedURLException, HTTP non-200 with error body readback
  • Negative tests: testServerUnreachable and testInvalidUrl with message assertions

Python (benchmark framework in scripts/staging/llm-bench/):

  • Runner with OpenAI, vLLM, and SystemDS backends
  • 5 workloads: math (GSM8K), reasoning (BoolQ), summarization (XSum), JSON extraction (CoNLL-2003), embeddings (STS-B)
  • Evaluation, aggregation, and HTML report generation
  • 131 unit tests covering accuracy checks, extraction logic, runner validation
  • License headers on all files

Key results (n=50 per workload)

Metric OpenAI gpt-4.1-mini vLLM Qwen 3B (H100) SystemDS Qwen 3B (H100)
Accuracy (math) 96% 68% 68%
Accuracy (reasoning) 88% 64% 60%
Accuracy (summarization) 86% 62% 50%
Accuracy (json_extraction) 61% (46 samples) 52% 52%
Accuracy (embeddings) 88% 90% 90%
Latency (math, ms) 4577 1911 1924
JMLC overhead <2% vs vLLM

SystemDS matches vLLM accuracy on 3/5 workloads exactly (math, json_extraction, embeddings — same model, same server). Small differences on reasoning/summarization are run-to-run variation (separate runs 13h apart). The JMLC pipeline (Py4J -> DML compilation -> Java HTTP) adds <2% latency.

Cost comparison

Workload OpenAI API Cost vLLM Compute Cost SystemDS Compute Cost
math $0.0223 $0.0559 $0.0563
reasoning $0.0100 $0.0307 $0.0323
summarization $0.0075 $0.0105 $0.0107
json_extraction $0.0056 $0.0152 $0.0155
embeddings $0.0019 $0.0014 $0.0014
Total (5 workloads) $0.047 $0.114 $0.116

How costs are computed:

  • OpenAI: Per-token API pricing (gpt-4.1-mini: $0.40/M input, $1.60/M output). Token counts from API usage response.
  • vLLM / SystemDS: Estimated from hardware ownership. Formula: electricity = (350W / 1000) * (wall_s / 3600) * $0.30/kWh + amortization = ($30,000 / 15,000h) * (wall_s / 3600). Wall time = latency_ms_mean * n_samples.

Hardware assumptions (NVIDIA H100 PCIe, the actual benchmark GPU):

  • GPU power draw: 350W (TDP)
  • Electricity: $0.30/kWh (EU average)
  • Hardware: $30,000 purchase price, 15,000h useful lifetime (~5 yr at 8 hr/day)
  • Amortization rate: $2.00/hr

Why local GPU appears more expensive: This benchmark runs only 250 sequential queries totaling ~3 min of inference. The H100 amortizes at $2.00/hr regardless of utilization. OpenAI only charges for tokens used, which wins at low volume. At full GPU throughput (~21 req/s on embeddings), amortized cost drops to ~$0.00003/query vs OpenAI's ~$0.0004/query — making owned hardware ~13x cheaper at scale.

Full documentation

scripts/staging/llm-bench/README.md — full methodology, all results tables, evaluation criteria, project structure, and setup instructions.

Generic LLM benchmark suite for evaluating inference performance
across different backends (vLLM, Ollama, OpenAI, MLX).

Features:
- Multiple workload categories: math (GSM8K), reasoning (BoolQ, LogiQA),
  summarization (XSum, CNN/DM), JSON extraction
- Pluggable backend architecture for different inference engines
- Performance metrics: latency, throughput, memory usage
- Accuracy evaluation per workload type
- HTML report generation

This framework can be used to evaluate SystemDS LLM inference
components once they are developed.
- Connection.java: Changed loadModel(modelName) to loadModel(modelName, workerScriptPath)
- Connection.java: Removed findPythonScript() method
- LLMCallback.java: Added Javadoc for generate() method
- JMLCLLMInferenceTest.java: Updated to pass script path to loadModel()
- Connection.java: Auto-find available ports for Py4J communication
- Connection.java: Add loadModel() overload for manual port override
- Connection.java: Use destroyForcibly() with waitFor() for clean shutdown
- llm_worker.py: Accept python_port as command line argument
Move worker script from src/main/python/systemds/ to src/main/python/
to avoid shadowing Python stdlib operator module.
- Add generateWithTokenCount() returning JSON with input/output token counts
- Update generateBatchWithMetrics() to include input_tokens and output_tokens columns
- Add CUDA auto-detection with device_map=auto for multi-GPU support in llm_worker.py
- Check Python process liveness during startup instead of blind 60s timeout
- Fix duplicate accuracy computation in runner.py
- Add --model flag and error handling to run_all_benchmarks.sh
- Fix ttft_stats and timing_stats logic bugs
- Extract shared helpers into scripts/utils.py
- Add HuggingFace download fallback to all loaders
- Fix reasoning accuracy false positives with word-boundary regex
- Pin dependency versions in requirements.txt
- Clean up dead code and unify config keys across backends
- Fix README clone URL and repo structure
- Use real token counts from Ollama/vLLM APIs, omit when unavailable
- Correct TTFT and cost estimates
- Add --gpu-hour-cost and --gpu-count flags for server benchmarks
- 121 unit tests for all accuracy checkers, loaders, and metrics
- ROUGE-1/2/L scoring for summarization (replaces quality-gate heuristic)
- Concurrent request benchmarking with --concurrency flag
- GPU profiling via pynvml
- Real TTFT for MLX backend via stream_generate
- Backend factory pattern and config validation
- Proper logging across all components
- Updated configs to n_samples=50
Replace declare -A (bash 4+ only) with a case function for
default model lookup. macOS ships with bash 3.x.
- New embeddings workload using STS-Benchmark from HuggingFace
- Model rates semantic similarity between sentence pairs (0-5 scale)
- 21 new tests for score extraction, accuracy check, sample loading
- Total: 142 tests passing across 5 workloads
- Add electricity + hardware amortization cost estimation to runner
  (--power-draw-w, --electricity-rate, --hardware-cost flags)
- Fix aggregate.py cost key mismatch (api_cost_usd vs cost_total_usd)
- Add compute cost columns to CSV output and HTML report
- Update README with cost model documentation and embeddings workload
Include all 10 benchmark runs (5 OpenAI + 5 Ollama, 50 samples each)
with metrics, samples, configs, HTML report, and aggregated CSV.
- 5 workloads x 2 models on NVIDIA H100 PCIe via vLLM
- Mistral-7B-Instruct-v0.3: strong reasoning (68%), fast embeddings (129ms)
- Qwen2.5-3B-Instruct: best embeddings accuracy (90%), 75ms latency
- Compute costs reflect H100 electricity (350W) + hardware amortization
- Regenerated summary.csv and benchmark_report.html with all 20 runs
- Connection.java: Changed loadModel(modelName) to loadModel(modelName, workerScriptPath)
- Connection.java: Removed findPythonScript() method
- LLMCallback.java: Added Javadoc for generate() method
- JMLCLLMInferenceTest.java: Updated to pass script path to loadModel()
- Connection.java: Auto-find available ports for Py4J communication
- Connection.java: Add loadModel() overload for manual port override
- Connection.java: Use destroyForcibly() with waitFor() for clean shutdown
- llm_worker.py: Accept python_port as command line argument
Move worker script from src/main/python/systemds/ to src/main/python/
to avoid shadowing Python stdlib operator module.
- Add generateWithTokenCount() returning JSON with input/output token counts
- Update generateBatchWithMetrics() to include input_tokens and output_tokens columns
- Add CUDA auto-detection with device_map=auto for multi-GPU support in llm_worker.py
- Check Python process liveness during startup instead of blind 60s timeout
Integrate SystemDS as a benchmark backend using the JMLC API. All prompts
are processed through PreparedScript.generateBatchWithMetrics() which
returns results in a typed FrameBlock with per-prompt timing and token
metrics. Benchmark results for 4 workloads with distilgpt2 on H100.
Run the embeddings (semantic similarity) workload with SystemDS JMLC,
bringing SystemDS to 5 workloads matching all other backends.
Run all 5 workloads with Qwen/Qwen2.5-3B-Instruct through the SystemDS
JMLC backend, replacing the distilgpt2 toy model. This enables a direct
apples-to-apples comparison with vLLM Qwen 3B: same model, different
serving path (raw HuggingFace via JMLC vs optimized vLLM inference).
Replace distilgpt2 toy model with same models used by vLLM backends:
- SystemDS + Qwen 3B (5 workloads) vs vLLM + Qwen 3B
- SystemDS + Mistral 7B (5 workloads) vs vLLM + Mistral 7B
All runs include compute cost flags (350W, $0.30/kWh, $30k hardware).
Increase JMLC worker timeout from 60s to 300s for larger models.
# Conflicts:
#	.gitignore
#	src/test/java/org/apache/sysds/test/functions/jmlc/JMLCLLMInferenceTest.java
- Use proper imports instead of inline fully-qualified class names
- Add try-with-resources for HTTP streams to prevent resource leaks
- Add connect/read timeouts to HTTP calls
- Add lineage tracing support for llmPredict
- Add checkInvalidParameters validation in parser
- Remove leftover Py4J code from Connection/PreparedScript
- Delete LLMCallback.java
- Remove .claude/.env/meeting_notes from .gitignore
- Trim verbose docstrings
- Use proper imports instead of inline fully-qualified class names
- Add try-with-resources for HTTP streams to prevent resource leaks
- Add connect/read timeouts to HTTP calls
- Add lineage tracing support for llmPredict
- Add checkInvalidParameters validation in parser
- Remove .claude/.env/meeting_notes from .gitignore
- Trim verbose docstrings
Supports parallel HTTP calls to the inference server via
ExecutorService. Default concurrency=1 keeps sequential behavior.
# Conflicts:
#	src/main/java/org/apache/sysds/parser/ParameterizedBuiltinFunctionExpression.java
#	src/main/java/org/apache/sysds/runtime/instructions/cp/ParameterizedBuiltinCPInstruction.java
- Delete Py4J-based benchmark results (will re-run with llmPredict)
- Remove license header from test (Matthias will add)
- Clarify llm_server.py docstring
JMLC requires the LHS variable name in read() assignments to match
the input name registered in prepareScript(). Changed X/R to
prompts/results so RewriteRemovePersistentReadWrite correctly
converts persistent reads to transient reads.
Correct SystemDS concurrency scaling numbers to match actual metrics.json
data (throughput-based instead of incorrect per-prompt estimates). Update
latency table, concurrency scaling table, run_all_benchmarks.sh for
automatic c=1/c=4 runs, and regenerate HTML report.
- Remove broken base SystemDS result directories (0% accuracy, 0ms latency
  from failed earlier run)
- Remove fabricated cost per query table (benchmarks were run without
  --power-draw-w/--hardware-cost flags, all cost data was $0)
- Fix accuracy claim: c=1 matches vLLM exactly, c=4 shows minor variation
  on reasoning (64% vs 60%) and summarization (62% vs 50%) due to vLLM
  batching non-determinism
- Add SystemDS c=1 and c=4 columns to accuracy tables
- Fix report.py to show c=1 and c=4 as separate backends instead of
  merging them into one "systemds (Qwen2.5-3B)" column
- Fix floating point truncation bug in accuracy tooltip (int(50*0.58)=28,
  now uses accuracy_count from metrics.json directly)
- Replace stale "Py4J bridge cost" references with "JMLC overhead"
- Regenerate HTML report and summary CSV
…usions

Major changes:
- Restructure README: move SystemDS architecture section before results,
  add compilation pipeline files, add JMLC code example
- Add measurement methodology note: vLLM uses Python streaming HTTP while
  SystemDS uses Java non-streaming HttpURLConnection, making per-prompt
  latency not directly comparable across backends
- Rewrite conclusions to be evidence-based: llmPredict correctness proven
  by accuracy match, concurrency scaling quantified, model-vs-backend
  distinction made explicit, latency caveat explained
- Remove MLX from supported backends table (not benchmarked), mark as
  "not benchmarked" in repo structure
- Remove fabricated OpenAI cost claim ($0.02-0.03)
- Remove "All backends overview" table (redundant with other tables)
- Simplify concurrency scaling table to throughput only (remove
  misleading effective latency columns)
- Put accuracy table first (apples-to-apples metric) before latency
…and evaluation methodology

- Fix bold-pattern regex in math number extraction: allow arbitrary text
  between number and closing ** (fixes 3 false negatives in OpenAI math,
  44/50 -> 47/50)
- Re-score all 30 result sets from raw samples.jsonl (only OpenAI math changed)
- Add complete cost comparison table with all backends including OpenAI
  API cost + local compute cost
- Add cost calculation formula with hardware assumptions
- Add evaluation methodology section explaining per-workload accuracy criteria
- Add cross-backend comparisons (SystemDS vs vLLM, OpenAI vs local,
  Qwen 3B vs Mistral 7B, Ollama analysis)
- Fix PR description scope: this is the benchmark framework PR, not llmPredict
- Fix hardware claims: Ollama/OpenAI ran on MacBook, not H100
- Add model names to SystemDS column headers (SystemDS Qwen 3B c=1/c=4)
- Explain Mistral's low math results (verbose output confuses extractor)
- Regenerate HTML report
The previous explanation attributed all failures to the number extractor.
Analysis of raw samples shows 20 of 31 incorrect answers were genuinely
wrong (wrong formulas, negative results, refusing to solve), while only
10 had the correct answer present but extracted the wrong number.
…ests, and license headers

- Extract llmPredict logic from ParameterizedBuiltinCPInstruction into
  dedicated LlmPredictCPInstruction class for better separation of concerns
- Add structured error handling: ConnectException, SocketTimeoutException,
  MalformedURLException, HTTP non-200 responses with error body readback
- Add conn.disconnect() in finally block for proper cleanup
- Add negative tests (testServerUnreachable, testInvalidUrl) with message
  assertions verifying error messages reach the user
- Add Apache license headers to llm_server.py and llm_worker.py (CI fix)
- Rewrite benchmark framework with SystemDS JMLC backend, strict HuggingFace
  dataset loaders, and run_all_benchmarks.sh orchestration script
- Fresh benchmark results: vLLM and SystemDS with Qwen2.5-3B on H100,
  5 workloads (math, reasoning, summarization, json_extraction, embeddings)
- Run OpenAI gpt-4.1-mini on all 5 workloads (math 96%, reasoning 88%,
  summarization 86%, json_extraction 61%, embeddings 88%)
- Update README with comprehensive results: OpenAI, vLLM Qwen 3B, and
  SystemDS Qwen 3B side-by-side accuracy, latency, throughput, and cost
- Regenerate summary.csv and benchmark_report.html with 15 total runs
@kubraaksux kubraaksux changed the title LLM benchmarking framework with SystemDS & Ollama & VLLM Backends - LDE Project LLM benchmarking framework with SystemDS, vLLM & OpenAI backends - LDE Project Feb 27, 2026
- Remove unused backends: mlx_backend.py, ollama_backend.py
- Remove unused files: llm_worker.py, benchmark_report.html
- Add computed electricity and hardware amortization costs to
  vLLM and SystemDS metrics.json files (H100: 350W, $0.30/kWh,
  $30k hardware, 15k hour lifetime)
- Update aggregate.py cost_per_1m logic for local backends
- Clean stale ollama/mlx references from report.py, runner.py,
  run_all_benchmarks.sh, requirements.txt
- Add pynvml to requirements.txt (used for GPU profiling)
- Update README with cost comparison tables and methodology
- Regenerate summary.csv
- math: remove 'last number anywhere' and 'last sentence-ending number'
  fallbacks from extract_number_from_response (returns None if no
  explicit answer marker found)
- reasoning: remove 'last short standalone line' fallback from
  _extract_answer (returns None if no marker found)
- embeddings: reject out-of-range scores instead of clamping (6.0 now
  returns -1.0 instead of 5.0)
- summarization: remove silent fallback to unigram overlap when
  rouge-score not installed (rouge-score is a required dependency),
  remove unused _tokenize helper and re import
- openai: remove str(resp) fallback when resp.output_text fails (let
  the error propagate instead of silently returning response repr)
- Updated tests to match new strict behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant